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ABSTRACT 

In this paper, we present a Convolutional Neural Network 
(CNN) regression approach for real-time 2-D/3-D registra¬ 
tion. Different from optimization-based methods, which iter¬ 
atively optimize the transformation parameters over a scalar¬ 
valued metric function representing the quality of the registra¬ 
tion, the proposed method exploits the information embedded 
in the appearances of the Digitally Reconstructed Radiograph 
and X-ray images, and employs CNN regressors to directly 
estimate the transformation parameters. The CNN regressors 
are trained for local zones and applied in a hierarchical man¬ 
ner to break down the complex regression task into simpler 
sub-tasks that can be learned separately. Our experiment re¬ 
sults demonstrate the advantage of the proposed method in 
computational efficiency with negligible degradation of reg¬ 
istration accuracy compared to intensity-based methods. 

Index Terms — 2-D/3-D Registration, Image Guided In¬ 
tervention, Convolutional Neural Network, Deep Learning 

1. INTRODUCTION 

2-D/3-D registration represents one of the key enabling 
technologies in medical imaging and image-guided inter¬ 
ventions m It can bring the pre-operative 3-D data and 
intra-operative 2-D data into the same coordinate system, to 
facilitate accurate diagnosis and/or provide advanced image 
guidance. The pre-operative 3-D data generally includes 
Computed Tomography (CT), Cone-beam CT (CBCT), Mag¬ 
netic Resonance Imaging (MRI) and Computer Aided Design 
(CAD) model of medical devices, while the intra-operative 
2-D data is dominantly X-ray images. In this paper, we focus 
on registering a 3-D X-ray attenuation map provided by CT 
or CBCT with a 2-D X-ray image in real-time. 

Although 2-D/3-D registration is a widely adopted tech¬ 
nology in medical imaging, real-time 2-D/3-D registration 
with sub-millimeter accuracy remains a great challenge. 
Most existing 2-D/3-D registration methods in the literature 
are optimization-based, in which the transformation parame¬ 
ters are iteratively updated to optimize an objective function 
refiecting the quality of the registration. Depending on the ob¬ 
jective function to be optimized, optimization-based methods 
can be further divided into intensity-based and feature-based 


methods O. In intensity-based methods, a simulated X-ray 
image, referred to as Digitally Reconstructed Radiograph 
(DRR), is derived from the 3-D X-ray attenuation map by 
simulating the attenuation of virtual X-rays mi). An opti¬ 
mizer is employed to maximize an intensity-based similarity 
measure between the DRR and X-ray images. Intensity-based 
methods are widely adopted mainly because of their high ac¬ 
curacy O. However, they often involve a large number of 
evaluations of the similarity measure, each requiring a high 
computational cost in rendering the DRR, and as a result are 
typically not suitable for real-time applications. In compari¬ 
son, feature-based methods calculate similarity measures ef¬ 
ficiently from geometric features extracted from the images, 
e.g., comers, lines and segmentations iifT], and therefore 
have a higher computational efficiency than intensity-based 
methods. One potential drawback of feature-based methods 
lies in the fact that they heavily rely on accurate detection 
of geometric features, which by itself could be a challenging 
task. Errors from the feature detection step are inevitably 
propagated into the registration result M, making feature- 
based methods in general less accurate (91 . 

In this paper, a Convolutional Neural Network (CNN) re¬ 
gression approach is presented for real-time 2-D/3-D regis¬ 
tration. The effectiveness of CNN has been shown in a wide 
range of computer vision tasks Go), but to the best of the 
authors’ knowledge, it has not been reported for 2-D/3-D reg¬ 
istration. We rely on the strong non-linear modeling capabil¬ 
ity of CNN to directly estimate the transformation parameters 
from the appearance of DRR and X-ray images. Compar¬ 
ing to intensity-based methods, which maps the images to a 
scalar-valued metric function, the proposed method better ex¬ 
ploits the information embedded in the images for more ef¬ 
ficient parameter update. Therefore, accurate 2-D/3-D regis¬ 
tration can be achieved with very few DRR renderings, mak¬ 
ing the proposed method highly computationally efficient and 
suitable for real-time applications. 

2. PROBLEM FORMULATION 

2.1. 3-D Transformation Parameterization 

A rigid-body 3-D transformation T can be parameterized by 
a vector t with 6 components. In our approach, we param- 
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Fig. 1 : Effects of the 6 transformation parameters 


eterize the transformation by 3 in-plane and 3 out-of-plane 
transformation parameters ifTTl . as shown in Fig. In partic¬ 
ular, in-plane transformation parameters include 2 translation 
parameters, tx and ty, and 1 rotation parameter, tg. The ef¬ 
fects of in-plane transformation parameters are approximately 
2-D rigid-body transformations. Out-of-plane transformation 
parameters include 1 out-of-plane translation parameter, tz, 
and 2 out-of-plane rotation parameters, and The effects 
of out-of-plane translation and rotations are scaling and shape 
changes, respectively. 


2.2. 2-D/3-D Registration via Regression 

We denote the X-ray image with transformation parameters t 
as It. The inputs for 2-D/3-D registration are: 1) a 3-D object 
described by its X-ray attenuation map J, 2) an X-ray image 
, where tgt denotes the unknown ground truth transforma¬ 
tion parameters, and 3) initial transformation parameters tini. 
The goal of 2-D/3-D registration is to estimate tgt from the 
inputs. It can be formulated as a regression problem, where a 
set of regressors /(•) are trained to reveal the mapping from 
a feature X{tini^ hgt) extracted from the inputs to the differ¬ 
ence between tini and tgt, as long as it is within a pre-defined 
range e: 

^’gt ^ini tini ^ ( 1 ) 

An estimation of tgt is then obtained by applying the regres¬ 
sors and incorporating the result into tini • 

igt = tini ^ f {x {tini ^ hgt)) ' ( 2 ) 

It is worth noting that the range e in Eqn. Q is equivalent to 
the capture range of optimization-based registration methods. 
Based on Eqn. Q, our problem formulation can be expressed 
as designing a feature extractor X(-) and training regressors 
/(•), such that 

6t^ f{X{t,It+st)), yStee. (3) 

In the next section, we will discuss in details 1) how the fea¬ 
ture X(f, h+st) is calculated and 2) how the regressors /(•) 
are designed, trained and applied. 



Fig. 2: Feature extraction from the DRR and X-ray images 

3. METHOD 
3.1. Feature Extraction 

We compute the residual between the DRR with transforma¬ 
tion parameters t, denoted by /t, and the X-ray image 
and use it as the feature for regression. The residual is com¬ 
puted within an ROI around the target object in the DRR, de¬ 
termined by f, as shown in Fig. A ROI can be described 
by (q, re, /z, 0), denoting the ROFs center, width, height and 
orientation, respectively. The center q is the 2-D projection of 
gravity center of the target object using transformation param¬ 
eters t. The width and height are calculated sls w = wq • D/tz 
and h = ho ' D/tz, respectively, where wq and ho are the 
size of the ROI in mm and D is the distance between the X- 
ray source and detector. The orientation (j) = tg, so that it is 
always aligned with the object. We define an operator H*{') 
that extracts the image patch in the ROI determined by t , and 
re-sample it to a fixed size (156x 300 in our experiment). The 
feature used for regression is then calculated as 

X{t, It+st) = H\lt) - H\lt+st). (4) 


3.2. Hierarchical Regression 

Our goal is to train 6 regressors f = {fx, fy, fz, fe, fa, fp} 
to reveal the correlation between X and 6t. Considering that 
X only contains 2-D information, the mapping from X to 
St could be very complex. To reduce the complexity of the 
regression problems, we carry out the following hierarchical 
regression steps. The steps are also illustrated in the workfiow 
diagram shown in Fig.|^ 

We first partition the parameter space spanned by and 
tjs with a 18x18 grid, each covering a 20°x20° zone. Six 
regressors are trained for each individual zone to solve 2-D/3- 
D registration problems with initial and in this zone. 
2-D/3-D registration tasks are dispatched into corresponding 
zones, according to their initial values of and Using this 
strategy, each regressor only needs to reveal the correlation 











Fig. 3: Workflow of the hierarchical regression strategy 



Fig. 4: Structure of the multi-task learning convolutional neu¬ 
ral network. 


between X and 5t for a small range of ta and (i.e., 20°), 
making the regression problems much simpler. 

We then divide the 6 regressors into 3 groups, {fx^ fy^ fe}^ 
{/a, //?} and {fz}, and regress them hierarchically. Among 
the 3 groups, the parameters in Group 1 are considered to 
be the easiest to be estimated, because they cause simple 
and dominant rigid-body 2-D transformation of the object in 
the projection image and are less affected by the variations 
of other parameters. The parameter in Group 3 is the most 
difficult one to be estimated, because it only causes subtle 
scaling of the object in the projection image. The difficulty 
in estimating parameters in Group 2 falls in-between. There¬ 
fore, we regress the 3 groups of parameters sequentially, 
from the easiest group to the most difficult one. After a 
group of parameters are regressed, the feature X(t, is 

re-calculated using the already-estimated parameters for the 
regression of the parameters in the next group. This way the 
regression for the current group becomes less complicated 
by removing the compounding factors coming from those 
parameters in the previous groups. 

The above hierarchical regressors can be applied once 
{single-pass mode) or multiple times {multi-pass mode). The 
multi-pass mode repeats the regression process for multiple 
iterations, with the result of the previous iteration being used 
as the starting position for the current iteration. 

3.3. Convolutional Neural Network for Regression 

3.3.1. Network Structure 

One CNN regression model with the architecture shown in 
Fig.j^is trained for each group in each zone. The input of the 
CNN regression model is a 156x300 image, computed fol¬ 
lowing Eqn. 0. The CNN consists of flve layers, including 
two 5x5 convolutional layers (Cl and C2), each followed by 
a 2 X 2 max-pooling layers (PI and P2) with a stride of 2, and 
a fully-connected layer (FI) with 250 Rectifled Linear Unit 
(ReLU) activations neurons. The output layer (F2) is fully- 
connected to FI, with each output node corresponding to one 
parameter in the group. 


3.3.2. Training 

The CNN regression models are trained exclusively on syn¬ 
thetic X-ray images, because they provide reliable ground 
truth labels with little needs on laborious manual annotation, 
and the number of real X-ray images could be limited. For 
each group in each zone, we randomly generate 25,000 pairs 
of t and dt. The parameters t follow a uniform distribution 
with ta and tjs constrained in the zone. The parameter errors 
5t for Group 1 follow a zero mean uniform distribution over 
ranges of 3.0 mm, 3.0 mm, 30.0 mm, 6°, 30° and 30°. The 
unform distribution ranges of dtx. Sty and Sty are reduced for 
Group 2 to 0.4 mm, 0.4 mm and 1.0°, because they are close 
to zero after the regressors in the Group 1 are applied. For 
the same reason, the distribution ranges of St a and t^ are re¬ 
duced for Group 3 to 1.5° and 1.5°. For each pair of t and 
St, a synthetic X-ray image It-\-5t is generated and the feature 
X(t, It-\-st) is calculated following Eqn. 0. 

The objective function to be minimized during the train¬ 
ing is defined as: 

i=l 

where K is the number of training samples, is the label 
for the i-th training sample, W is a vector of weights to be 
learned, /(X^; W) is the output of the regression model pa¬ 
rameterized by W on the i-th training sample. The weights 
W are learned using Stochastic Gradient Descent (SGD) cni, 
with a batch size of 64, momentum of m = 0.9 and weight 
decay of d = 0.0001 . The learning rate Ki is decayed in 
each iteration following Ki = 0.0025 • (1 + 0.0001 • 

The weights are initialized using the Xavier method ifT^ . and 
mini-batch SGD is performed for 32 epochs. 

4. EXPERIMENTS AND RESULTS 
4.1. Experiment Setup 

We conducted experiments on a dataset from a potential ap¬ 
plication, Virtual Implant Planning System (VIPS), which is 















































an intraoperative application to facilitate the planning of im¬ 
plant placement in terms of orientation, angulation and length 
of the screws ca. In VIPS, 2-D/3-D registration can be per¬ 
formed to match the 3-D virtual implant with the fluoroscopic 
image of the real implant. The dataset consists of a CAD 
model of a volar plate and 7 X-ray images of the volar plate 
implant mounted onto a phantom model of the distal radius. 
The size of the X-ray images is 1024 x 1024 with a pixel spac¬ 
ing of 0.223 mm. The 3-D CAD model was converted to a bi¬ 
nary volume using marching cube algorithm for registration. 
Ground truth transformation parameters used for quantifying 
registration error were generated by first manually register¬ 
ing the target object and then applying an intensity-based 2- 
D/3-D registration method using Powell’s method combined 
with Gradient Correlation (GC). For each X-ray image, 140 
perturbations of the ground truth were generated as starting 
positions for 2-D/3-D registration. The perturbation followed 
zero mean Gaussian distribution with standard deviations of 
1.0 mm, 1.0 mm, 10.0 mm, 2°, 10° and 10°. 

We compared the proposed method in three-pass mode 
with three state-of-the-art intensity-based 2-D/3-D registra¬ 
tion methods. Powell’s method was adopted as the optimizer 
for all evaluated intensity-based methods as its advantage in 
2-D/3-D registration over other popular optimization methods 
has been shown in ca. We evaluated two popular similar¬ 
ity measures. Mutual Information (MI) and GC, which have 
also been reported to be effective in recent literature ifTSll ifT^ . 
We also merged the two methods using MI and GC to form 
an improved intensity-based 2-D/3-D registration method for 
comparison. The combined method, referred to as MI-fGC, 
first applies MI to bring the registration into the capture range 
of GC, and then applies GC to refine the registration. 

The experiments were conducted on a workstation with 
Intel Core i7-4790k CPU, 16GB RAM and Nvidia GeForce 
GTX 980 GPU. For intensity-based methods, the most com¬ 
putationally intensive component, DRR renderer, was imple¬ 
mented using the Ray Casting algorithm with GPU accelera¬ 
tion. Similarity measures were implemented in and ex¬ 
ecuted in a single CPU core. Both DRR and similarity mea¬ 
sure were only calculated within a512x512 ROI surrounding 
the target object, for better computational efficiency. For the 
proposed method, the neural network was implemented with 
GPU acceleration using an open-source deep learning frame¬ 
work, Caffe ifTTl . 

4.2. Results 

The registration accuracy was accessed with the mean Tar¬ 
get Registration Error in the projection direction (mTRE- 
proj) (TSl, calculated at the 8 comers of the bounding box of 
the target object. We regard mTREproj less than 1% of the 
size of the target object (i.e. diagonal of the bounding box) 
as a successful registration, which is equivalent to 0.61 mm. 
For each evaluated method, we report its success rate, mean 


Table 1: Quantitative experiment results including: 1) suc¬ 
cess rate, 2) mean mTREproj calculated among successful 
registration, and 3) average and standard deviation of running 
time per registration. 


Method 

Success Rate 

Mean mTREproj 

Running Time 

MI 

75.1% 

0.315 mm 

1.66±0.60s 

GC 

78.7% 

0.285 mm 

3.91±1.55s 

MI+GC 

92.7% 

0.260 mm 

4.71±1.59s 

Proposed 

92.3% 

0.282 mm 

0.08±0.00 s 


of mTREproj of successful registrations and running time per 
registration. 

Table summarizes the experiment results. Both MI 
and GC resulted in relatively low success rates (75.1% and 
78.7%), because of the low accuracy of MI and the small cap¬ 
ture range of GC. By combing the advantages of MI and GC, 
MI-fGC, achieved much higher success rate (92.7%) and very 
low mTREProj (0.260 mm), suggesting that it achieves both 
high robustness and accuracy. In comparison, the proposed 
method achieved comparable success rate (92.3%) with a 
slightly higher but still similar mTREproj (0.282 mm), com¬ 
pared to MI-fGC. Considering that the ground truth parame¬ 
ters were generated using GC, which could bear a slight bias 
toward intensity-based methods using GC as the similarity 
measure, the small differences in success rate and mTREproj 
between the proposed method and MI-fGC indicate that they 
achieved comparable robustness and accuracy. 

In terms of speed, the 3 intensity-based methods, MI, GC 
and MI-fGC, are in general not fast enough for real-time reg¬ 
istration. The fastest one, MI, took in average 1.66 s to ac¬ 
complish 2-D/3-D registration, while the most accurate one, 
MI-fGC, took in average 4.71 s. In comparison, the proposed 
method achieved a significantly higher speed (0.08 s), demon¬ 
strating its significant advantage in computational efficiency. 
In addition, the running time for intensity-based methods has 
relatively large standard deviations because the number of it¬ 
erations involved in the optimization can vary for each regis¬ 
tration depending on the starting position. In comparison, the 
standard deviation of the computation time for the proposed 
method is almost zero, showing that it can provide a real-time 
registration with a constant frame rate. 

5. CONCLUSION 

In this paper, we presented a real-time 2-D/3-D registration 
approach based on CNN regression. We showed that 2-D/3-D 
registration can be efficiently solved by training CNN regres¬ 
sors to reveal the mapping from image residual to transfor¬ 
mation parameter residual. We also validated via experiments 
that the proposed method achieved significantly higher com¬ 
putational efficiency than intensity-based methods, with neg¬ 
ligible degradation of registration accuracy. 
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